prio-queue: fold lazy_queue into prio_queue for automatic get+put fusion#2140
prio-queue: fold lazy_queue into prio_queue for automatic get+put fusion#2140spkrka wants to merge 2 commits into
Conversation
518e1cc to
f2d0e8c
Compare
|
The pull request has 687 commits. The max allowed is 30. Please split the patch series into multiple pull requests. Also consider squashing related commits. |
|
There are merge commits in this Pull Request: Please rebase the branch and force-push. |
137dd0a to
29af244
Compare
|
/submit |
|
Submitted as pull.2140.git.1780757885582.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
writes:
> Add prio_queue_size() for callers that need the logical element
> count, since the physical nr may temporarily include a
> pending-removal element.
Many code paths used to learn how many elements it logically has by
directly peeking into .nr member of the prio_queue struct. Now they
should call this new helper function, and you converted some in this
patch.
How can we be sure that all such users of prio_queue has been
converted? Are direct references to .nr member, outside of the
prio-queue.c implementation, all now suspect?
For example, object-name.c:get_oid_oneline() uses a prio-queue
"copy", and loops "while (copy.nr)". In the loop, it calls
pop_most_recent_commit(), which does a get followed by put of its
parents. If the get become hanging (e.g., root commit, causing no
_put() performed in pop_most_recent_commit()), would copy.nr still
remain 1 but logically no elements remain in the queue.
There seem to be other direct peeking of .nr member remaining in the
code. Perhaps the member should be renamed to catch in-flight
topics that add more users of prio-queue that peek into the .nr
member, or something like that. |
50a332a to
bb8b0f7
Compare
|
Kristofer Karlsson wrote on the Git mailing list (how to reply to this email): Junio C Hamano <gitster@pobox.com> writes:
> How can we be sure that all such users of prio_queue has been
> converted? Are direct references to .nr member, outside of the
> prio-queue.c implementation, all now suspect?
You're right, and the patch is thus broken in its current state.
I did a rename of .nr to ._nr on the branch and rebuilt -- that
immediately found several callers I missed:
- object-name.c: get_oid_oneline()
(like you also found)
- fetch-pack.c: mark_recent_complete_commits()
- builtin/last-modified.c: last_modified_run()
- path-walk.c: walk_objects_by_path()
- commit-reach.c: queue_has_nonstale()
The describe.c and show-branch.c callers already compensate for
get_pending in their iteration bounds, but they still reach into
.nr directly.
> Perhaps the member should be renamed to catch in-flight topics
> that add more users of prio-queue that peek into the .nr member,
> or something like that.
Agreed, that's the right fix. I looked for existing ways of marking
fields as private, internal or hidden but the only thing I found was
the convention of using a code comment: /* for internal use only */
I will apply a rename and submit a v2. Perhaps something like
nr_internal to make it look less like a public API. |
|
User |
|
/submit |
|
Submitted as pull.2140.v2.git.1780772477.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
René Scharfe wrote on the Git mailing list (how to reply to this email): On 6/6/26 9:01 PM, Kristofer Karlsson via GitGitGadget wrote:
> Rene's lazy_queue wrapper in describe.c was a clever optimization -- by
> deferring the get, a following put becomes a simple replace, avoiding a full
> remove-rebalance-insert cycle.
>
> It turns out this pattern is so common in git's traversal code that it makes
> sense to fold it into prio_queue itself. Gets and puts are interleaved in
> virtually every commit walk, so the fusion is essentially always a win.
>
> This is mostly a code simplification -- three callers had independently
> reimplemented the same optimization, and they all collapse to plain get+put
> now. The 3-6% speedup on traversal-heavy workloads is a nice bonus.
>
> More details and benchmark numbers in the commit message. Benchmarks were
> run on next which includes kk/commit-reach-optim -- those results represent
> the more realistic end state.
>
> Related to but independent of the cascade sift-down work in
> kk/prio-queue-cascade-sift -- the two can land in either order.
>
> Changes since v1:
>
> * Added a second commit that renames .nr to .nr_internal so that direct
> access from outside prio-queue.c is a compile error. Verified that after
> the rename, only prio-queue.c references nr_internal.
>
> * Added prio_queue_for_each() macro for callers that need to walk all
> elements (describe.c, show-branch.c, commit-reach.c, revision.c,
> negotiator/skipping.c).
>
> * Converted remaining .nr loop conditions to use
> prio_queue_get()/prio_queue_peek() as the loop condition, or
> prio_queue_size() where get/peek isn't suitable.
>
> * Fixed several callers missed in v1 (object-name.c, fetch-pack.c,
> path-walk.c, pack-bitmap-write.c, negotiator/default.c,
> negotiator/skipping.c, revision.c, builtin/last-modified.c).
>
> Kristofer Karlsson (2):
> prio-queue: fold lazy_queue into prio_queue for automatic get+put
> fusion
> prio-queue: rename .nr to .nr_internal to prevent direct access
>
> builtin/describe.c | 70 ++++++++-------------------------
> builtin/last-modified.c | 7 ++--
> builtin/show-branch.c | 24 +++++-------
> commit-reach.c | 24 ++++++------
> commit.c | 11 +-----
> fetch-pack.c | 4 +-
> negotiator/default.c | 4 +-
> negotiator/skipping.c | 12 +++---
> object-name.c | 2 +-
> pack-bitmap-write.c | 10 ++---
> path-walk.c | 8 ++--
> prio-queue.c | 77 ++++++++++++++++++++-----------------
> prio-queue.h | 19 +++++----
> revision.c | 16 ++++----
> t/unit-tests/u-prio-queue.c | 6 +--
> walker.c | 4 +-
> 16 files changed, 129 insertions(+), 169 deletions(-)
>
>
> base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2140%2Fspkrka%2Flazy-prio-queue-pr-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2140/spkrka/lazy-prio-queue-pr-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/2140
>
> Range-diff vs v1:
>
> 1: 29af24445e = 1: 29af24445e prio-queue: fold lazy_queue into prio_queue for automatic get+put fusion
> -: ---------- > 2: bb8b0f78f1 prio-queue: rename .nr to .nr_internal to prevent direct access
>
My earlier attempt in <90270818-c52b-4611-8da2-6cee20628fc2@web.de>
copied the last item to the root and decreased .nr, to allow callers to
scan items and get their count directly.
Checking emptiness by doing the existing calls of prio_queue_peek() and
prio_queue_get() a bit earlier and scanning using a foreach macro are
fine as well and arguably cleaner, at the low cost of having to change
all the callers.
The result is faster than my attempt, but still slower than the current
code in the describe benchmark from 30598ccc4d (describe: use oidset in
finish_depth_computation(), 2025-09-02):
Benchmark 1: ./git_main describe $(git rev-list v2.41.0..v2.47.0)
Time (mean ± σ): 601.7 ms ± 1.9 ms [User: 538.6 ms, System: 47.3 ms]
Range (min … max): 599.3 ms … 606.5 ms 10 runs
Benchmark 2: ./git_auto_replace describe $(git rev-list v2.41.0..v2.47.0)
Time (mean ± σ): 618.0 ms ± 1.1 ms [User: 554.5 ms, System: 47.6 ms]
Range (min … max): 616.7 ms … 620.2 ms 10 runs
Benchmark 3: ./git_fold describe $(git rev-list v2.41.0..v2.47.0)
Time (mean ± σ): 609.9 ms ± 0.8 ms [User: 546.7 ms, System: 47.4 ms]
Range (min … max): 608.8 ms … 611.2 ms 10 runs
Benchmark 4: ./git describe $(git rev-list v2.41.0..v2.47.0)
Time (mean ± σ): 606.1 ms ± 1.2 ms [User: 543.7 ms, System: 46.7 ms]
Range (min … max): 604.7 ms … 609.1 ms 10 runs
Summary
./git_main describe $(git rev-list v2.41.0..v2.47.0) ran
1.01 ± 0.00 times faster than ./git describe $(git rev-list v2.41.0..v2.47.0)
1.01 ± 0.00 times faster than ./git_fold describe $(git rev-list v2.41.0..v2.47.0)
1.03 ± 0.00 times faster than ./git_auto_replace describe $(git rev-list v2.41.0..v2.47.0)
git_auto_replace: <90270818-c52b-4611-8da2-6cee20628fc2@web.de> and
revert of 08bb69d70f (describe: use prio_queue_replace(), 2025-08-03)
git_fold: this series
git: this series and the patch below
My attempt leaves performance on the table by using a bool. Using
an unsigned for the flag is measurably faster -- but still slower
than your series here.
Calling flush_get() later, when we know that we have items and a
compare function, is cleaner, as we never need it in LIFO mode, and
is also slightly faster (patch below).
Still there's this 1% performance gap to the current code that I
don't understand. Do you see it as well?
René
---
prio-queue.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/prio-queue.c b/prio-queue.c
index d11ca6ac36..45709187d3 100644
--- a/prio-queue.c
+++ b/prio-queue.c
@@ -100,24 +100,23 @@ static void sift_down_root(struct prio_queue *queue)
void *prio_queue_get(struct prio_queue *queue)
{
- flush_get(queue);
-
if (!queue->nr_internal)
return NULL;
if (!queue->compare)
return queue->array[--queue->nr_internal].data; /* LIFO */
+ flush_get(queue);
queue->get_pending = 1;
return queue->array[0].data;
}
void *prio_queue_peek(struct prio_queue *queue)
{
- flush_get(queue);
-
if (!queue->nr_internal)
return NULL;
if (!queue->compare)
return queue->array[queue->nr_internal - 1].data;
+
+ flush_get(queue);
return queue->array[0].data;
}
|
|
Kristofer Karlsson wrote on the Git mailing list (how to reply to this email): On Sun, 7 Jun 2026 at 09:30, René Scharfe <l.s.r@web.de> wrote:
>
> Calling flush_get() later, when we know that we have items and a
> compare function, is cleaner, as we never need it in LIFO mode, and
> is also slightly faster (patch below).
Thanks for the benchmark and the suggestion to move flush_get()
below the LIFO check - that's cleaner since LIFO never sets
get_pending.
One edge case to note: without a second !nr_internal check after
flush_get(), two consecutive get() calls on a single-element queue
will return stale data instead of NULL. I went a step further and
inlined the flush logic directly into get()/peek(), which also
removes the forward declaration.
> Still there's this 1% performance gap to the current code that I
> don't understand. Do you see it as well?
Yes, I saw a similar trend on my laptop (Core Ultra 7 155U),
but with very high variance - the results were too noisy to be
conclusive even with 20+ runs.
On an idle server (Xeon @ 2.20GHz) with much lower variance, all
three variants (v2 as posted, your patch, and the inlined version)
came out ~1.3% faster than the baseline across 30 interleaved
runs (p < 0.01). So it seems CPU-dependent - possibly branch
prediction or code alignment differences between microarchitectures.
Results from the idle server (30 interleaved runs, paired t-test):
Variant Avg SE vs baseline 95% CI p
-------------- ---------- -------- ------------ ---------------- ----------
baseline 2002.5ms 9.2ms (baseline)
v2-posted 1976.6ms 3.2ms -1.29% -41 to -11ms 0.0019
v2-rene 1977.7ms 3.1ms -1.24% -42 to -8ms 0.0071
v2-latest 1975.3ms 1.8ms -1.36% -46 to -9ms 0.0069
baseline: 9ac3f193c0 (The 11th batch)
v2-posted: v2 as sent to the list
v2-rene: v2 + your flush_get patch
v2-latest: v2 + inlined flush (for v3)
Will send a v3 with the inlined flush shortly.
- Kristofer |
d76c3c4 to
033215e
Compare
|
/submit |
|
Submitted as pull.2140.v3.git.1780832592.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): "Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
writes:
> Changes in v3:
>
> * Adopted Rene's suggestion to move the flush logic below the LIFO
> early-return (LIFO mode never sets get_pending, so flushing there is a
> no-op).
Sensible.
> * Went a step further and inlined the flush logic directly into get() and
> peek(), eliminating the flush_get() helper and its forward declaration of
> sift_down_root().
Hmph, unless there is a reason to allow the copies in get() and
peek() to deviate from each other, e.g., what flush_get() had to do
inside get() and peek() were slightly different, I am not sure if
this is a good move. I do not know if the slight difference of the
"inlined" logic we have in the patch between the one in get() and
the other one in peek() has merit, either. It certainly lets you
avoid an unnecessary clearing of the get_pending bit (when a get was
pending but the queue has more items to yield) immediately followed
by turning it back on again (which happens always unless the
function makes an early return for an empty queue) in get(), which
will never happen in flush() that will always clear the bit before
it returns, but is such an avoidance of an assignment really worth
it? I suspect that with the static inline version of flush_get(),
compilers are smart enough to optimize it away, but I dunno.
> void *prio_queue_get(struct prio_queue *queue)
> {
> if (!queue->nr)
> return NULL;
> if (!queue->compare)
> ++ return queue->array[--queue->nr].data;
> ++
> ++ if (queue->get_pending) {
> ++ if (!--queue->nr) {
> ++ queue->get_pending = 0;
> ++ return NULL;
> ++ }
> ++ queue->array[0] = queue->array[queue->nr];
> ++ sift_down_root(queue);
> ++ }
> +
The above is from [1/2] (this is a minor point, but flipping the
order of two patches to make the "nr_internal clean-up" as a
preliminary step might have made commenting on this part easier to
read). I wondered if it is easier to understand if the first early
return is guarded by a conditional that takes get_pending into
account.
if (queue->nr_internal <= queue->get_pending)
return NULL;
As I said, since the above hunk is immediately followed by an
unconditional assignment of "queue->get_pending = 1", clearing
get_pending = 0 only when we leave inside the if() block works as an
optimization that is not available on the peek() side. But with the
"ah the queue is empty already, the queue->ne == 1 is due to the
lazy get that did not rebalance" tweak, it would become unneeded, I
think.
> + void *prio_queue_peek(struct prio_queue *queue)
> + {
> + if (!queue->nr_internal)
> return NULL;
> if (!queue->compare)
> + return queue->array[queue->nr_internal - 1].data;
> +
> + if (queue->get_pending) {
> + queue->get_pending = 0;
> +- if (!--queue->nr)
> ++ if (!--queue->nr_internal)
> + return NULL;
> +- queue->array[0] = queue->array[queue->nr];
> ++ queue->array[0] = queue->array[queue->nr_internal];
> + sift_down_root(queue);
> + }
This is from [2/2]; the same
if (queue->nr_internal <= queue->get_pending)
return NULL;
applies here, I think. |
|
Junio C Hamano wrote on the Git mailing list (how to reply to this email): Kristofer Karlsson <krka@spotify.com> writes:
> Agreed, that's the right fix. I looked for existing ways of marking
> fields as private, internal or hidden but the only thing I found was
> the convention of using a code comment: /* for internal use only */
>
> I will apply a rename and submit a v2. Perhaps something like
> nr_internal to make it look less like a public API.
I think we often use trailing underscore (e.g., "n_") to mark
variables for "not to be used casually, there are better ways to
access this information", which pre-ANSI C people probably used
leading underscore (e.g. "_n") for.
This is often used in callback functions where the types of their
formal parameters are specified by the API and use of them with
casting at every use site is awkward. For example, qsort() and
friends often take a pointer to the location that stores the value
to be compared, but it is awkward, so we do cast just once upfront
like so:
static int compare_callback(const void *a_, const void *b_)
{
const a_type a = *((const a_type *)a_);
const a_type b = *((const a_type *)b_);
... use values 'a' and 'b', without having to cast a_ or b_ ...
return a - b;
}
I think the technique/convention can be used in a similar way for
"this is hidden and unless you can tell if you should be using it
directly, you probably shouldn't" kind of structure members.
So, nr_internal is perfectly fine, but if you find it too long, nr_
is probably just as good. |
Rename the .nr member to .nr_ so that callers outside prio-queue.c that directly reference .nr get a compilation error. This catches both existing misuse and future in-flight topics. Add prio_queue_size() for callers that need to know the element count and prio_queue_for_each() for callers that need to walk all elements. Convert all external .nr users: - Loop conditions: use prio_queue_size(), prio_queue_get(), or prio_queue_peek() as the loop condition - Array iterations: use prio_queue_for_each() Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Defer the actual removal in prio_queue_get() until the next operation. If that next operation is a prio_queue_put(), the removal and insertion are fused into a single replace — writing the new element at the root and sifting it down — which avoids a full remove-rebalance-insert cycle. This matches the dominant usage pattern in git's commit traversal: get a commit, then put its parents. The first parent insertion after each get is now a replace operation automatically. This generalizes the lazy_queue pattern from builtin/describe.c (introduced in 08bb69d) into prio_queue itself. Three callers independently implemented the same get+put fusion: - builtin/describe.c had a full lazy_queue wrapper - commit.c:pop_most_recent_commit() used peek+replace - builtin/show-branch.c:join_revs() used peek+replace All three now collapse to plain _get() and _put(), with the data structure handling the fusion internally. This simplifies callers and means every prio_queue user gets the optimization for free without needing to implement it manually. Remove prio_queue_replace() since no external callers remain. Benchmarked on a 1.8M-commit monorepo (30 interleaved runs, paired t-test, Xeon @ 2.20GHz): Code paths that previously did eager get+put (new optimization): Command base patched change p merge-base --all A A~1000 3828ms 3725ms -2.69% 0.0001 rev-list --count A~1000..A 3055ms 2986ms -2.27% 0.0601 log --oneline A~1000..A 3408ms 3350ms -1.71% 0.0482 Code paths that already had manual get+put fusion (expect neutral — the optimization moves into prio_queue but the number of heap operations stays the same): Command base patched change p show-branch A A~1000 9156ms 9127ms -0.32% 0.3470 describe (4751 revs, 81K repo) 1983ms 1963ms -1.02% <0.001 No regressions in any scenario. Suggested-by: René Scharfe <l.s.r@web.de> Signed-off-by: Kristofer Karlsson <krka@spotify.com>
033215e to
a3f4cb5
Compare
|
/submit |
|
Submitted as pull.2140.v4.git.1780945851.gitgitgadget@gmail.com To fetch this version into To fetch this version to local tag |
Rene's lazy_queue wrapper in describe.c was a clever
optimization -- by deferring the get, a following put becomes
a simple replace, avoiding a full remove-rebalance-insert
cycle.
It turns out this pattern is so common in git's traversal code
that it makes sense to fold it into prio_queue itself. Gets
and puts are interleaved in virtually every commit walk, so
the fusion is essentially always a win.
This is mostly a code simplification -- three callers had
independently reimplemented the same optimization, and they
all collapse to plain get+put now. The 1.7-2.7% speedup on
traversal-heavy workloads is a nice bonus.
More details and benchmark numbers in the commit message.
Related to but independent of the cascade sift-down work in
kk/prio-queue-cascade-sift -- the two can land in either
order.
Changes in v4:
Thanks Junio for review, applied all suggestions.
Renamed .nr_internal to .nr_
Restored flush_get() as a static inline helper instead of
inlining the flush logic into get() and peek().
Guard empty-queue check with nr_ <= get_pending.
Flipped commit order: the rename/accessor commit is now
first, and the behavioral fusion change is second. This
was partly messy -- the first rename commit introduces
some ugly intermediate code (e.g. describe.c's
prio_queue_for_each with a skip variable) that gets
cleaned up in commit 2 when the lazy get makes it
unnecessary.
Changes in v3:
Adopted Rene's suggestion to move the flush logic below
the LIFO early-return (LIFO mode never sets get_pending,
so flushing there is a no-op).
Went a step further and inlined the flush logic directly
into get() and peek(), eliminating the flush_get() helper
and its forward declaration of sift_down_root().
Updated benchmark numbers with more rigorous methodology:
30 interleaved runs with paired t-test on an idle server.
Split results into code paths that already had manual
fusion (neutral) vs code paths that benefit from the new
automatic fusion (1.7-2.7% improvement).
Changes in v2:
Added a second commit that renames .nr to .nr_internal so
that direct access from outside prio-queue.c is a compile
error. Verified that after the rename, only prio-queue.c
references nr_internal.
Added prio_queue_for_each() macro for callers that need to
walk all elements (describe.c, show-branch.c, commit-reach.c,
revision.c, negotiator/skipping.c).
Converted remaining .nr loop conditions to use
prio_queue_get()/prio_queue_peek() as the loop condition,
or prio_queue_size() where get/peek isn't suitable.
Fixed several callers missed in v1 (object-name.c,
fetch-pack.c, path-walk.c, pack-bitmap-write.c,
negotiator/default.c, negotiator/skipping.c, revision.c,
builtin/last-modified.c).
CC: René Scharfe l.s.r@web.de
cc: Kristofer Karlsson krka@spotify.com